A Multitask Learning Approach to Document Representation using Unlabeled Data
نویسندگان
چکیده
Text categorization is intrinsically a supervised learning task, which aims at relating a given text document to one or more predefined categories. Unfortunately, labeling such databases of documents is a painful task. We present in this paper a method that takes advantage of huge amounts of unlabeled text documents available in digital format, to counter balance the relatively smaller available amount of labeled text documents. A Siamese MLP is trained in a multi-task framework in order to solve two concurrent tasks: using the unlabeled data, we search for a mapping from the documents’ bag-of-word representation to a new feature space emphasizing similarities and dissimilarities among documents; simultaneously, this mapping is constrained to also give good text categorization performance over the labeled dataset. Experimental results on Reuters RCV1 suggest that, as expected, performance over the labeled task increases as the amount of unlabeled data increases. 2 IDIAP–RR 06-44
منابع مشابه
Deep Unsupervised Domain Adaptation for Image Classification via Low Rank Representation Learning
Domain adaptation is a powerful technique given a wide amount of labeled data from similar attributes in different domains. In real-world applications, there is a huge number of data but almost more of them are unlabeled. It is effective in image classification where it is expensive and time-consuming to obtain adequate label data. We propose a novel method named DALRRL, which consists of deep ...
متن کاملThe Benefit of Multitask Representation Learning
We discuss a general method to learn data representations from multiple tasks. We provide a justification for this method in both settings of multitask learning and learning-to-learn. The method is illustrated in detail in the special case of linear feature learning. Conditions on the theoretical advantage offered by multitask representation learning over independent task learning are establish...
متن کاملAutomatic word lemmatization
This paper is addressing a problem of automatic word lemmatization using machine learning techniques. We illustrate a way that sequential modeling can be used to improve the classification results, in particular to enable modeling sub-problems mostly having less than 10 class values, instead of addressing all 156 class values in one problem. We independently induced two models for automatic lem...
متن کاملImage Classification via Sparse Representation and Subspace Alignment
Image representation is a crucial problem in image processing where there exist many low-level representations of image, i.e., SIFT, HOG and so on. But there is a missing link across low-level and high-level semantic representations. In fact, traditional machine learning approaches, e.g., non-negative matrix factorization, sparse representation and principle component analysis are employed to d...
متن کاملSemi-Supervised Representation Learning based on Probabilistic Labeling
In this paper we present a new algorithm for semisupervised representation learning. The algorithm is based on assigning class probabilities to unlabeled data. The approach will use Hilber-Schmidt Independence Criterion (HSIC) to find a mapping which takes the data to a lower-dimensional space. We call this algorithm SSRL-PL. Use of unlabeled data for learning is not always beneficial and there...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006